Advanced topic 7, 2012
Topic 7 from Advanced Topics 2012 Large-Scale Network-Service Disruption: Dependencies and External Factors 'Introduction' With the rapid development of Internet technology, a lot of information came out not only bring convenience, but also tie it down with the interaction between the virtual and reality. Thus, network-service dissruption has become one of the most important issues, network outage. Although there are numbers of previous research which focus on natural and man-made disasters, the dependencies between other entities have inadequate research and understanding. This paper will overview the large scale external disturbances in communication services and implement series methods to evaluate the dependencies between communication service-disruptions and external factors. Two aspects of constitute network and external-factors below: |- ! scope="row"| Dependencies of communication disruptions with external factors | |} Following are 5 sections of the paper study: In Section I and II : communication disruptions and dependencies using publicly available networks and organizations. In Section III : Whether the network to interrupt occurrence of hurricanes, the National Hurricane Center and GeoIP respectively urban storm data and subnet geographic areas. In Section IV : seeking root causes directly from organizations’ own subnets. In Section V: the discussion of the research issues of information access and share to raise awareness in the community of interdisciplinary, such as communication networks, power and meteorology. 'Network Disruption and Dependency' A. ''Network Unreachability'' In this case, we use BGP to infer subnet unreachabilities. When a subnet becomes unreachable, BGP routers withdraw routes to the subnet by sending multiple withdrawals to each other 1. When the subnet becomes reachable, multiple announcements are sent among BGP routers to establish new routes. To infer subnet unreachabilities, we collect BGP update messages belonging to 230 subnets in Texas area from Route Views during September 10-20, 2008(This period is from the announcement of evacuation to one week after a landfall of Hurricane Ike) 2. We also collect BGP updates between August 1-September 9, to provide a baseline of subnet unreachabilityduring normal operations. The following two steps indicate how unreachable subnets are identified: 1. Use a hierarchical clustering algorithm 3 to reduce the 230-dimensional time series to a small number of clusters. And for each cluster contains subnents with a similaer pattern of BGP updates 2. An unreachable status of a representative subnet from each cluster is inferred from its BGP updates as follows: -Identifying bursts of update messages: If the inter-arrival time between two BGP upadtes is smaller than the threshold, then these two BGP updates are grouped in the same burst. -Identifying unreachable subnets and their unreachable durations: Within each burst, if all the prior connectivity from the BGP routers are withdrawn, the subnet is considered as unreachable. and the time instance when this unreachabilityfirst occurs is the “initial time" of subnet unreachability. The unreachable duration is the period between the initial time andthe next BGP announcement 4. Figure 1 shows the inferred unreachable subnets in the chosen time duration. Among these unreachable subnets, 120 subnets became unreachable in anintense period from 10:00 a.m. September 12 to 10:00 a.m. September 13. B. ''Temporal Independence'' Temporal dependencies, i.e., whether sub-nets became unreachable statistically dependently in time. In this case, there are two types of unreachable subnets. One consists of subnets that became unreachable at a large time scale; consists of subnets that became unreachable at a large time scale. For those in the small time scale, each group is represented by a unique initial time when the last subnet in the group became unreachable. We obtain 56 groups, each with a unique initial time, presented at the bottom of Figure 2. The unique initial times in Figure 2 occurred in the following three disjoint intervals: (1) Gradually from 10:16 a.m. September 12 to 8:25 p.m., (2) immensely between 8:25 p.m. S eptember 12 and 3:00 a.m. September 13, and (3) less frequently from 3:00 a.m. to 10:00 a.m. September 13. And the unique time epochs occur statistically independently, with different average inter-unreachable times at different intervals, according to a non-uniform Poisson point process. 'C. ''Temporal and Network Dependence Now we examine subnets that became unreachable in groups to understand the temproal and logical dependence of subnet unreachability. The figure 3 indicates the dependence of unreachable subnets, organizations, and Ases in a hierarchy of the logical network space. The first row shows 22 organizations which have its own independent unreachable subnets. (1) Within-Organization Dependence We examine each group to determine whether and which subnets from the same group became unreachable dependently. We begin by identifying unreachable subnets from the same organization. If subnets from the same group and the same organization exhibit a similar pattern of withdrawal bursts, these two subnets are dependent within an organization. (Table I presents an example.) To determine whether two BGP withdrawal bursts exhibit asimilar pattern, we compute the correlation coefficient using inter-arrival times of BGP updates inside the bursts, from the same BGP routers and to the two subnets. We find 18 groups whose correlation coefficients of BGP bursts are between 0.9150-1.000. These 18 groups correspond to 77 within-organization dependent subnets, 14 organizations, 14 ASes, and are shown in the last three rows of Figure 3. Our result shows that organization is a logical variable whose subnets can become unreachable dependently. (2) Cross-Organization and Within AS Dependence We study the remaining groups that contain subnets from different organizations but the same AS. There are four groups of cross-organization dependent sub-nets whose correlation coefficients of BGP bursts are between 0.9829-1.000. The subnets from these four groups belong to five organizations and three ASes as shown in the last row of Figure 3. Our findings in this section show that AS is another logical variable that characterizes subnets from different organizations to become unreachable dependently. '''D. ''Comparison with Normal Operations We validate ourfindings through comparing with subnet unreachabilities in the pre-Ike interval when no major network disruptions were reported 5. We find that the one-day Ike pe-riod between 10:00 a.m. September 12-10:00 a.m. September 13 indeed presents anomaloussubnet unreachabilities whereas the other days appeared to be normal. Table II provides the comparison of unreachable subnets between the pre-Ike and the Ike periods. Compared to the pre-Ike duration, the unreachabilities occurred during Ike are 158.54 times more in quantity, at 40.05 times higher rate, and 7.09-day longer duration. Moreover, 183.33 times more unreachable subnets during Ike are from small organizations at the network edge, showing the impact of external disturbances. Our results tell which subnets were unreachable from BGP routers where the unreachabilitycorresponds to two scenarios. Thefirst is when a subnet was disconnected from the rest of the network. The second is when the subnet maintained its connectivity to some parts of the network but could not be reached from BGP routers, possibly due to other disconnec-tivity in the network 15. Therefore, BGP update messages can be used to infer reachability of prefixes but do not provide sufficient information to delineate these cases. 'Externalfactor: Correlating Withstorm' 'A. Strom Data Obtain hourly observation data of Hurricane Ike from:the public and forecast advisories7&''the best track data''8 The duration: 10:00am Sep 12 - 10:00 am Sep 13. The storm data: atitude and longitude coordinates of storm 'storm speed 'wind radii of hurricane force winds (more than 74 miles per hour)'The storm center and the coverage moved between 12-13 miles per hour 7. Since the subnet unreachabilities occureed at the scale of minutes, we reconstruct the storm path( the trajectory of the hurricane center) and coverage(the area of the hurricane surface spanned by a wind radii of hurricane force wind) at the scale of 15 minutes. '''B. ''Geo-Location of Subnets'' Geo-locations of subnets are needed to relate the subnet unreachability to the hurricane. In this case, We adopt geo-location data provided by GeoIP City from Maxmind 9 since Maxmind specifies the accuracy of their measurements. Specifically, Maxmind provides latitude and longitude loca-tions of IP addresses and reports that approximately 79% of their provided geo-locationsare “correctly resolved within 25 miles from a true location” 9. We obtain the geo-location of every IP-address for each prefix (e.g., 256 locations for /24 subnet) from GeoIP. To incorporate inaccurate information provided by GeoIP, each geo-location then becomes the center of a disk with 25-mile radius where a true location lies. As a result, we obtain the 120 subnet regions. C. ''Relating Network Unrechability and Storm'' Now we use the reconstructed storm coverage and the subnet geo-location to study how subnet unreachability relates to the storm. As the geo-location data is inaccurate, we define a coverage probability as follows to infer when a subnet was in the storm coverage. Define: 1. Subnet i with region Si (1≤i ≤120). 2. Rt: the storm coverage at time t that consists of a hurricane center and wind radii, for t∈a.m. September 12, 10:00 a.m. September 13. 3. p(i,t)= to be the coverage probability that the storm appears at the location of subnet i at time t, where∣⋅∣denotes the area. Figure 4 shows the color-coded values of the coverage prob-ability for 120 unreachable subnets. Each subnet corresponds to one horizontal line in time. The coverage probability starts to increase at 6:15 p.m. September 12 until p(i,t) reaches a maximum value, and then decreases as the storm moves out of subnet region Si, 1 ≤i ≤120. Then we define the “hitting time” when the storm coveragefirst overlaps with a subnet region. The empirical distribution of the hitting times is obtained by projecting these hitting time epochs to the top part of Figure 4. The empirical distribution of the initial times when the subnets first became unreachable is shown in the middle of Figure 4 for comparison. The initial times span a longer duration than the hitting times, i.e., from 10:16 a.m. September 12 to 9:20 a.m. September 13 with the maximum between 0:30-1:15 a.m. September 13, which was 4.5 hours later than the maximum of the hitting times. This implies that the majority of the subnet unreachabilities occurred after the storm arrived the subnet regions. However, there are subnets that became unreachable prior to the arrival of the storm. We shall further discuss this in Section III-E. D. ''Correlating Network Unreachability with Storm'' We now correlate the initial time of unreachability with the hitting time for individual subnets. Define: 1. : Thi hitting time of subnet i, where 1≤i≤102. 2. : Tui initial time of subnet i, where 1≤i≤102. 3. Ii(t) and Ihi(t) be two indicator functions for subnet i, where Ii(t) =1 if t=Tui, i.e., when subnet i became initially unreachable; otherwise equals to 0. Ihi(t) =1 if t∈Thi, i.e., the storm hit the subnet region i when t falls in this interval; otherwise equals to 0. The sample correlation function between the hitting times and initial times is computed as: All thesefindings suggest that the storm may not be the only direct cause of subnet unreachability, and there may exist hidden variables that are not inferable from the data. This motivates us to seek for the actual root causes of subnet unreachability in Section IV. E. ''Non-Causality ofSubnet Unreachability'' Figure 4 shows that subnets could become unreachable before the storm approached, i.e., the initial time appears earlier than the hitting time. Such a phenomenon corresponds to the non-causality of the unreachable subnets. These observations suggest that there may exist hidden vari-ables which are neither taken into consideration nor inferable from our data. Hence, it is necessary to seek root causes directly from organizations who experienced the network disruptions. 'External Factor: Seeking Root Causes' We sought the root causes by actively contacting the organi-zations who own the unreachable subnets. All 40 organizations were contacted through extensivefield work, e.g., emails and surface mails as well as phone calls. Four organizations offered the root causes. 1. ''Houston Advanced Research Center (HARC)'' Their subnet became unreachable nine hours prior to the storm coverage, and lasted for more than a week. The network administrators shut down the network as a precaution due to lack of spare power. In addition, HARC experienced power outages starting on 7:45 a.m. September 13, after the Ike landfall. The power outages lasted until September 18, and their ISP experienced unknown network connectivity problem on September 19. There was no physical network damage caused by the hurricane. This illustrates how power outages could strongly impact network reachability. 2. X is an ISP in Texas Using GeoIP geo-locations, the storm coverage never appeared in six subnets, and the other four subnets became unreachable between 2.66-6.41 hours after the storm coverage appeared in their subnet regions. The root causes were reported to be power outages. 3. ''NASA-Johnson Space Center'' Using GeoIP geo-locations, one subnet had no storm coverage appearance, an-other subnet became unreachable 4.19 hours prior to the storm coverage, and the remaining unreachable subnets occurred between 3.04 and 8.18 hours after the storm coverage. The network administrators acknowledged that the root cause of some of their unreachable subnets was power outage. 4. University of Texas Medical Branch at Galveston (UTMB) Using the geo-location from UTMB, the subnet became unreachable about 1.85 hours after it appeared in the storm coverage. The network experienced power failures when the rising waterflooded the power generators. The network infrastructure itself was not damaged. In summary, three root causes were power outages, and one root cause was the lack of spare power. The communication networks themselves experienced little damage for the four organizations. One subnet was verified to become unreachable several hours prior to the storm coverage. Hence, power is another external factor, where power outages caused service disruptions in communication. 'Conclusion and Futuredirection' In summary, 32 subnets in our data set were found to be-come unreachable independently in time. Eighty-eight subnets became unreachable dependently in time, where 77 are from the same organization, and 11 from same AS but different organizations. Power outages were reported to be a type of root causes. Lack of spare power was reported to be the cause of a non-causal unreachable subnet. 1. Thefirst research issue emerged is information acquisi-tion across communication and power infrastructures as well as weather. 2. The second research issue is information sharing across multiple organizations and providers. Organizations possess pertinent information of geo-locations and root causes. Suggestions on this topic: 1. Increase the mobility or backup of network equipments (switch,router,base station, database etc.) to avoid the effect of external factors (security). 2. Further research on Optimizing the network structure in different organizations. 3. Establish the plan on service-failure recovery(configuration). 'Reference' 1 Y. Rekhter, T. Li, and S. Hares, “A Border Gateway Protocol 4.” RFC 1771, 1995. 2 Route Views project. Available: http://www.routeview.org. 3 L. Kaufman and P. J. Rousseeuw, Finding Groups in Data: An Intro-duction to Cluster Analysis.John Wiley and Sons, 1990. 4 P. Cheng, X. Zhao, B. Zhang, and L. Zhang, “Longitudinal study of BGP monitor session failures,”Comput. Commun. Rev., 2010. 5 The NANOG mailing list archives - Aug. 2008. Available: http://mailman.nanog.org/pipermail/nanog/2008-August/thread.html. 6 M. Caesar, L. Subramanian, and R. Katz, “Towards localizing root causes of BGP dynamics,” Technical Report CSD-03-1292, Computer Science Department, Universityof California-Berkeley, Nov. 2003. 7 National Hurricane Center - Hurricane Ike advisory archive. Available:http://www.nhc.noaa.gov/archive/2008/IKE.shtml. 8 National Hurricane Center - A digital record of the complete best trackdata.Available: ftp://ftp.nhc.noaa.gov/atcf/archive/2008/bal092008.dat .gz. 9 GeoIP City. Available: http://www.maxmind.com/app/city.